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O^- Abstract 

. The pace of progress in the fields of Evolutionary Computation and Machine Learning is 

fT^ I currently limited — in the former field, by the improbability of making advantageous exten- 

sions to evolutionary algorithms when their capacity for adaptation is poorly understood, 
and in the latter by the difficulty of finding effective semi-principled reductions of hard real- 
world problems to relatively simple optimization problems. In this paper we explain why 
a theory which can accurately explain the simple genetic algorithm's remarkable capacity 
. for adaptation has the potential to address both these limitations. We describe what we 

^ ' believe to be the impediments — historic and analytic — to the discovery of such a theory 

and highlight the negative role that the building block hypothesis (BBH) has played. We 
argue based on experimental results that a fundamental limitation which is widely believed 
to constrain the SGA's adaptive ability (and is strongly implied by the BBH) is in fact 
' illusionary and docs not exist. The SGA therefore turns out to be more powerful than it is 

. currently thought to be. We give conditions under which it becomes feasible to numerically 

approximate and study the multivariate marginals of the search distribution of an infinite 
population SGA over multiple generations even when its genomes are long, and explain 
why this analysis is relevant to the riddle of the SGA's remarkable adaptive abilities. 
■ Keywords: genetic algorithm, optimization, adaptation, non-convex, machine learning 



. 1. Introduction 

X, 

^ I The practice of Machine Learning research can be characterized as the effective semi- 

principled reduction of learning problems to problems for which robust and efficient solution 
techniques exi st — ideally ones with provable bound s on their use of time and space. In 
a recent paper Bennett and Parrado-Hernandez ( 20061 ) describe the synergistic relationship 



between the fields of machine learning (ML) and mathematical programming (MP). They 
remark: 

"Optimization lies at the heart of machine learning. Most machine learning 
problems reduce to optimization problems. Consider the machine learning ana- 
lyst in action solving a problem for some set of data. The modeler formulates 
the problem by selecting an appropriate family of models and massages the data 
into a format amenable to modeling. Then the model is typically trained by solv- 
ing a core optimization problem that optimizes the variables or parameters of 
the model with respect to the selected loss function and possibly some regu- 
larization function. In the process of model selection and validation, the core 
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optimization problem may be solved many times. The research area of math- 
ematical programming theo ry intersects with machine learning thr ough these 
core optimization problems ( Bennett and Parrado-Hernandez . 20061 ). 



Later iBennett and Parrado-Hernandea imply that when the targets of ML reductions have 
been optimization problems, they have for the most part been the convex optimization 
problems within the MP pantheon. 

Convexity plays a key role in mathematical programming. Convex programs 
minimize convex optimization functions subject to convex constraints ensuring 
that every local minimum is always a global minimum. In general, convex prob- 
lems are much more tractable algorithmically and theoretically. The complexity 
of noncon yex problems can grow enormous l y. Ge neral nonconvex programs are 
NP-hard ( Bennett and Parrado-Hernandez . 20061 ) . 



The close relationship between ML and MP arguably exists because MP provides ML 
with a set of crisp, well-defined problems along with algorithmic solvers that come with 
guarantees on their use of time and space. To state this using metaphors from software 
engineering, the well-defined convex optimization problems are interfaces that MP publishes, 
and the provably efficient and robust algorithmic solvers of MP implement these interfaces. 

Let us differentiate, in this paper, between optimization and adaptation. We define 
optimization as the procurement of one or more points of optimal or close-to-optimal value, 
and adaptation as the generation of points of increasing value over time. Given this def- 
inition, to say that the target problems of Machine Learning reductions are optimization 
problems is to fudge the truth somewhat. While the Mathematical Programming com- 
munity indeed seems to be almost completely concerned with the procurement of optimal 
or close to optimal points, ML researchers aren't interested in optimization per se but in 
the means by which it is achieved in most MP algorithms, i.e. adaptation. In fact opti- 
mization is often prevented in machine learning algorithms — using a "technique" named 
early-stopping — to prevent overfitting. In other words, robust, efficient adaptation is the 
modus operandi of most convex optimization algorithms, and for the most part, it is this 
modus operandi that makes these algorithms interesting to Machine Learning researchers. 

The interface-problems published by the MP community give ML researchers useful 
targets to hit; if a ML researcher works out a semi-principled reduction of a class of learning 
problems to one of MP's interface-problems, there are off-the-shelf algorithms within MP 
which allow her to quickly test whether her reduction is effective. 

Because of the emphasis that the ML community place on guarantees of robustness 
and efficiency, when the targets of ML reductions have been optimization problems, they 
have for the most part, been restricted to being convex optimization problems within MP. 
These problems are rather simple as adaptation problems go — every local optimum is 
also a global optimum, or stated differently, there are no local optima. Rather heroic feats 
of ingenuity are therefore necessary in order to obtain effective semi-principled reductions 
of hard problems to these simple optimization problems. The difficulty of obtaining such 
reductions is cur rently a fundam ental limitation on the pace of progress within ML. 

The SGA dMitcheli Il996l ^ is an adaptation algorithm which mimics natural sexual 



evolution. It has been directly applied to a large number of hard real-world problems 
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and has often succeeded in generating solutions of remarkably high-quality. To be sure, 
some amount of thought is required to "massage" these problems into a form which allows 
the SGA to operate successfully on them (e.g. choices must be made about the fitness 
function used and the way solutions are encoded as bitstrings), but unlike the case in 
machine learning this massaging is largely ad-hoc, an outcome more of trial and error than 
principled reasoning. The resulting problems are almost certainly hard ones (non-convex), 
with objective functions that are riddled with local optima. It is a testament to the adaptive 
power of the SGA that it nevertheless often produces solutions of remarkably high quality. 
Given these successes one might expect a great deal of interest in SGAs from the machine 
learning community. That this is not the case speaks to an unfortunate shortcoming of 
GA research. There is no dearth of one-off problems that SGAs have adequately solved. 
However GA researchers have yet to publish a single class of problems such that a) SGAs 
are likely to perform robust, efficient adaptation when applied to problems in this class, and 
b) the class is likely to be useful as the target of ML reductions. For the sake of brevity we 
will loosely define such a class of problems as an SGA-Easy/ML-Useful class. We believe 
that when such problem classes are found the ML community will begin to take a greater 
interest in GA research. The future relationship between the GA and ML communities 
might then be similar to the one that currently exists between MP and ML. 

As mentioned above SGAs commonly adapt high-quality solutions to problems which 
are almost certainly contain large numbers of local optima. It is reasonable therefore to 
suspect that there exists an SGA-Easy/ML-Useful class of hard non-convex problems and 
that the identification of this class will significantly ease the burden of obtaining novel 
ML reductions. We believe that the identification of such a problem class will go hand in 
hand with the discovery of a theory which can give a satisfying explanation of the adaptive 
capacity of the SGA. Such a theory does not currently exist. 



1.1 The Dubious History of the Building Block Hypothesis 



Perceptions of the abilities and limitations of the SGA (and hence the kinds of problems 
that it can and cannot solve) have been heavily influenc e d by a theory of adaptation called 
the building block hypothesis (jGoldberej . Il989l : iMitchell I199(tI : iHollandl . Il975l . \2Q0d ). This 
theory of adaptation has its genesis in the following idea: maybe small groups of closely 
located co-adaptive alleles propagate within an evolving population of genomes i n much 
the s ame way that single adaptive alleles do in Fisher's theories of sexual evolution (jFisheii . 
19581 ). Holland called such groups of alleles building blocks. This idea can be taken one step 
further: maybe small groups of co-adaptive building blocks propagate within an evolving 
population of genomes in much the same way that single building blocks do. Such groups 
can be thought of as higher-level building blocks. Pursuing this idea to the fullest extent, 
maybe co-adaptive groups of higher-level building blocks propagate in much the same way 
as ordinary building blocks do to yield building blocks of an even higher level, and so on 
and so forth in hierarchical fashion with the building-blocks of higher levels being comprised 
of co-adaptive groups of lower-level building blocks. Let us call this this idea hierarchical 
building block assembly. 

Hollandl (119751^ saw in hiera rchical building block assembly a way out from the problem 
that epistasis (jWolf et al.l . l200d ^ poses for Fisher's theory of sexual evolution. He also 
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believed that hierarchical building block assembly, if implemented efficiently, could serve 
as a useful problem solving technique. He argued that a genetic algorithm that he called 
a genetic plan can implement hierarchical building block assembly, and moreover does so 
efficiently. He offered the genetic plan as a model of natural sexual evolution and also as 
a useful technique for finding solutions to adaptation problems with non-convex objective 
functions. The m ain theoretical tool that he used in his argument has come to be called the 
schema theorem dGoldberel . Il989l : iMitcheli llQQfil ). However neither the schema theorem, 
nor any of Holland's other theoretical analyses fully support his claim that simple genetic 
algorithms are capable of efficiently implementing hierarchical building block assembly . 
Given the boldness of his claim and the large leaps of i ntuition that H olland makes in order 
to support it, the absence of experimental support in (iHollandl . I1975I ) is rather conspicuous 
(even more so given that simple, com putationally unintens ive, proof-of-concept exper i ments 
are not difficult to conceive of. See, iMitchell etHI . Il992l . and iForrest and Mitchell \l99i ) 
. It would not have been surprising therefore if the genetic plan had been relegated to the 
history books as an algorithm that did not fulfill its raison d'etre — to support its inventor's 
hunch about the utility of hierarchical building block assembly as a theory of adaptation for 
natural sexual evolutionary systems, and to support its inventor's hunch that hierarchical 
building block assembly can be efficiently implemented. What seems to have saved the SGA 
from this fate is the curious matter of its utility. 



In the years following the publication of Holland's seminal work (IHollandl . Il975l ). 



the SGA was successfully used to adapt high-quality solutions to different sorts of real 
world and toy problems with non-convex objective functions. In an unfortunate twist of 
reasoning hierarchical building block assembly became the de-facto explanation for the 
success of the SGA. This explanation came to be called the building block hypothesis. 
Despite its name, the building block hypothesis was treated more as an assumption than as 
a hypothesis. Hierarchical building block assembly had ae sthetic appeal, and the building 
block hypothesis had Holland's unqualified endorsement ( Holland . 19921 ). Therefore the 
building block hypothesis was readily accepted by most within the GA community. Some 
even went so far as to tout the success of SGAs as evidence of the veracity of the building 
block hypothesis or as evidence that hierarchical building block assembly is a useful search 
technique for a wide variety of search problems. Consider the following confused passage 
from one of the first text books on genetic algorithms: 

"...the building block hypothesis has held up in many different problem domains. 
Smooth, unimodal problems, noisy multimodal problems, and combinatorial 
optimization problems have all been attacked successful l y usi ng virtually the 
same reproduction-crossover- mutation [S]GA." (jGoldberd . Il989l l 



The early support that the building block hypothesis enjoyed accounts for the deep impact 
it has had and continues to have on the course of research in genetic algorithms as well as 
other fields of evolutionary computation such as genetic programming. 

Recently the building block hypothesis has been sharply criticized for lacking ad- 
equate th eoretical supp o rt. T he most forceful criticism that we are aware of has been 



levied by I Wright et al 



(I2OO3I I: "The various claims about [S]GAs that are tradition- 
ally made under the name of the building block hypothesis have, to date, no basis in 
theory, and, in some cases, are simply incoherent". On the empirical side experimen- 
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tal results have been obtained which straightforwardly cast doubt upon the ability of 
a simple genet i c algorithm to efficiently implein ent hierarchical building block assembly 
(|Mitchell et al.l . Il992l : iForrest and Mitchell SEi). .n response to these experimental re- 
sults a silent transition has occurred within the field of genetic algorithms: hierarchi- 
cal building block assembly has gone from being thought of as the abstract process that 
SGAs implement to being thought of as a normative process that SGAs mis-implement. 
Even though this transition between intellectual positions is completely specious it is now 
widely assumed that SGAs work because they manage to "fudge" hierarchical building 
block assembly. Many new genetic algorithms have bee n constructed to compensate for 
the perceiv ed short- comings of the GA ~e.g. messy GA, 



19891, 12002). LLGA ( Harik and Goldberd.ll997l:lGoldberg , 



EGG A (lHarikl.ll999l'l. cohort GA (iHollandl . \200ai). FDA 



LFDA (iMiihlenbein and Mahnid . l200l|)^^OAj Pelikan et al 



Goldberg et al.l. Il989l: iGoldber 
2OO2I). GGA dHarik et al .Ul99 



Miihlenbein and Mahnig . 1999.), 
" ' l999l : lGoldbere] . 12002! ). hBOA 



(|Pelikan and Goldberd . I2OOII ). SEAM (IWatsonl . I200I l200(Tl Vetc. The inventors of these al- 
gorithms claim, or at least imply, that their algorithms are better than the SGA at its own 
game — hierarchical building block assembly. In many circles within the GA community 
the curious matter of frequent utility of SGAs is now considered closed. 

For a case in point of the kind of sleight of hand that we are discussing consider the 
following: conceding that there is little evidence that SGAs ca n efficiently and robustly 
implement hierarchical building block assembly, Holland ( 2000l ) remarks, "Are [S]GA's, 
then, a robust approach to all problems in which building blocks play a key role? By no 
means! After years of investigation we still have only limited information about the [SJGA's 
capabilities for exploiting building blocks" . Later he asserts that "the very essence of good 
GA design is retention of diversity, furthering exploration, while exploiting building blocks 
already discovered" , and presents a new genetic algorithm, the Cohort G enetic Algorithm, 
and argues that it implements this essence (see (|Pei and Goodmanl . |200l|) for evidence that 
it does not). 

The field of genetic algorithms is both a scientific field as well as an engineering 
domain. Heedful science and meticulous engineering can often work synergistically. However 
when the boundary between science and engineering begins to blur, dogma and misplaced 
faith can beleaguer the practice of both, to wit, a system that is useful in practice, but 
does not implement a hypothetical mechanism may receive reduced attention, whereas the 
mechanism, far from being dismissed according to the basic norms of science may become 
the holy grail of the engineering goals of the field. 

A theory that explains why a system exhibits a particular behavior can influence 
perceptions of how the system can behave, and also of how it cannot. Of the two kinds 
of perceptions, the latter kind is often judged in retrospect to be the greater impediment 
to the discovery of a new theory that can explain and predict the behavior of the system 
with greater accuracy. This is because by influencing perceptions of how the system cannot 
behave a theory implicitly determines the "domain of the impossible" and in doing so it 
steers researchers away from considering certain possibilities. Yet it is precisely amongst 
these " impossibilities" that the seeds of a new more accurate theory often lie. 

One of the two goals of this paper is to challenge the widespread belief that the SGA 
cannot increase the frequency of a low order schema with above-average fitness when the 
defining length of that schema is high (i.e. when the defining bits of that schema are widely 
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dispersed) . Thi s belief can be traced back to Holland's original treatise on genetic algorithms 
(jHollandl . fl975l ) and goes hand in hand with belief in the building block hypothesis (and 
variations thereof). In section [TT] we provide an argument based on experimental evidence 
that this belief is misplaced. We believe that this errant belief will be judged in retrospect 
to have been a significant impediment to the discovery of a sound theory of adaptation for 
the SGA. 



1.2 A Type of Analysis Which Might Yield an Explanation for the SGA's 
Capacity for Adaptation (And Why This Kind of Analysis is Difficult) 

Our second goal is to present theoretical results which, in our opinion, are likely to advance 
efforts to discover a theory of adaptation for the SGA. 

The search distribution of an SGA in some generation is a particular kind of distri- 
bution over the set of all genomes. This distribution is implicitly determined by the bit- 
strings in the population of the previous generation, their fitness values, the chosen selection 
scheme, and the variation operators of the SGA. The bitstrings in the current generation 
can be thought to be generated by monte-carlo sampling from this search distribution. The 
search distribution is a very useful concept because it gives one a clean way to conceptually 
distinguish between the "deterministic component" and the "stochastic component" of the 
effect of selection and variation in each generation. Another way to say this is that the 
implicit creation of a new search distribution from the individuals in some generation is 
the only part of the adaptation process over which the selection and variation operators 
exert any "deterministic" control. Beyond that, chance is the sole determiner of which in- 
dividuals are actually generated in the next generation. As the size of the population tends 
towards infinity the role that chance plays in determining the composition of populations 
diminishes. In the limit as the population size tends to infinity, chance plays no role at all. 
In this case the search distribution in some generation exactly describes the composition 
of the population in that generation. In this case it is also possible in principle to exactly 
determine the search distribution after any number of generations. We say 'in principle' 
because (as we will soon discuss) this is computationally infeasible in practice when the 
genome are long. 

An infinite population SGA (IPSGA) is a mathematical model of an SGA with a 
large but finite population. Therefore studies that examine how the search distribution 
of an IPSGA changes over multiple generations have the potential to shed light on the 
"deterministic component" of the multi-generational effect of selection and variation on the 
search distribution of an SGA with a large but finite population. Such knowledge may lead 
to a satisfying theory of adaptation for the SGA. 

That said, not all studies of effects of evolution on the search distribution of an 
IPSGA are equally likely to yield theories of adaptation. We believe that studies that 
examine the effects of evolution over a small number of generations are more likely to be 
useful than those that examine the effects of evolution in the asyrnptote of time, namely 
dynamical systems analyses of the fixed points of IPSGAs (e.g. Vose . 19991 ). This is for two 



reasons. Firstly an analysis of the fixed points of a dynamical system does not reveal how 
or why the system reaches those fixed points. Adaptation in IPSGAs is a transient, not an 
asymptotic, phenomenon. Therefore a study of the fixed points of IPSGAs is unlikely to 
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yield answers about how adaptation occurs. Secondly, recall that an IPSGA is an inexact 
mathematical model of an SGA with a large but finite population. The longer the timescale 
under consideration, the more likely it is that the search distribution of the IPSGA will 
diverge from that of the SGA — even one with a large population. 

One promising way to study the changes in the search distribution of an IPSGA over 
multiple generations is to study the multivariate marginals of this changing distribution. 
For an arbitrary IPSGA this becomes computationally infeasible very quickly as the length 
of the bitstring genomes increases (for reasons we will soon explain). In this paper we 
will derive conditions under which such a study i s feasible for g e nomes of arbitrary length . 
Stati ng the above in the language of schemata (jHollandl . Il975l : iGoldbergl . Il989l : iMitchell 
19961 ). we will derive conditions under which i t becomes computat ionally feasible to track the 
frequencies of schemata in a schema family ( Wright et al. . [20031 ) over multiple generations 



even when the the bitstring genomes of the IPSGA are long. 

Let us briefly review why calculating the frequencies of an IPSGA over multiple 
generations is computationally infeasible for long genomes. An IPSGA with genomes of 
length £ can be modeled by a set of 2^ coupled difference equations. For each genome in 
the search space there is a corresponding state variable which gives the frequency of the 
genome in the population, and a corresponding difference equation which describes how 
the value of that state variable in some generation can be calculated from the values of 
the state variables in the previous generation. A naive way to calculate the frequency 
of some schema over multiple generations is to numerically iterate the IPSGA over many 
generations, and in each generation, to sum the frequencies of all the genomes that belong 
to the schema. The simulation of one generation of an IPSGA with a genome set of size 
N has time complexity 0{N^), and an IPSGA with bitstring genomes of length i has a 
genome set of size = 2^. Hence, th e time comp lexity for a numeric simulation of one 
generation of an IPSGA is 0(8^) . See (|Vosel . Il999l . p. 36) for a description of how the Fast 
Walsh Transform can be used to bring this bound down to 0(3^).) Even when the Fast 
Walsh Transform is used, computation time still increases exponentially with i. Therefore 
for large i the naive way of calculating the frequencies of schemata over multiple generations 
clearly becomes computationally infeasibl^. 

Holland's schema theorem (jHollandl . Il975l : iGoldberd . Il989l : iMitchelj Il996l l was the 
first theoretical result which allowed one to calculate (albeit imprecisely) the frequencies 
of schemata after a single generation. The crossover and mutation operators of a GA can 
be thought to destroy some schemata and construct others. Holland only considered the 
destructive effects of these operat ors. His theorem was therefore an inequality. Later work 
( Stephens and Waelbroeck . 199?! ) contained a theoretical result which gives exact values 
for the schema frequencies after a single generation. Unfortunately for IPSGAs with long 
bitstrings this result does not straightforwardly suggest conditions under which schema 
frequencies can be numerically calculated over multiple generations in a computationally 
feasible way. 



1. Vose reported in 1999 that computational concerns force numeric simulation to be limited to cases where 
i<20 
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1.3 The Promise of Coarse-Graining 

Coarse-graining is a teclinique that has widely been used to study aggregate properties (e.g. 
temperature) of many-body systems with very large numbers of state variables (e.g. gases). 
This technique allows one to reduce some system of difference or differential equations 
with many state variables (called the fine-grained system) to a new system of difference or 
differential equations that describes the time-evolution of a smaller set of state variables 
(the coarse-grained system). The state variables of the fine-grained system are called the 
microscopic variables and those of the coarse-grained system are called the macroscopic 
variables. The reduction is done using a surjective non-injective function between the 
microscopic state space and the macroscopic state space. This function is called the partition 
function. States in the microscopic state space that share some key property (e.g. energy) 
are projected to a single state in the macroscopic state space. The reduction is therefore 
'lossy', i.e. information about the original system is typically lost. Metaphorically speaking, 
just as a stationary light bulb projects the shadow of some moving 3D object onto a flat 
2D wall, the partition function projects the changing state of the fine-grained system onto 
states in the state space of the coarse-grained system. 

The term 'coarse-graining' has been used in the Evolutionary Computation literature 
to describe different sorts of reductions of the equations of an IPSGA. Therefore we now 
clarify the sense in which we use this term. In this paper a reduction of a system of 
equations must satisfy three conditions to be called a coarse- graining. Firstly, the number of 
macroscopic variables should be smaller than the number of microscopic variables. Secondly, 
the new system of equations must be completely self-contained in the sense that the state- 
variables in the new system of equations must not be dependent on the microscopic variables. 
Thirdly, the dynamics of the new system of equations must 'shadow' the dynamics described 
by the original system of equations in the sense that if the projected state of the original 
system at time t = is equal to the state of the new system at time t = then at any other 
time t, the projected state of the original system should be closely approximated by the 
state of the new system. If the approximation is instead an equality then the reduction is 
said to be an exact coarse-graining. Most coarse-grainings are not exact. This specification 
of coarse-graining is consistent with the way this term is typically us ed in the sc i entific 
literature. It is also similar to the definition of coarse-graining given in teowe et al.l . hood ) 



(the one difference being that in our specification a coarse-graining is assumed not to be 
exact unless otherwise stated). 

Suppose the vector of state variables x^*) is the state of some system at time t and 
the vector of state variables y^*) is the state of a coarse-grained system at time t. Now, 
if the partition function projects x'-''^ to y^'^^ then, since none of the state variables of 
the original system are needed to express the dynamics of the coarse-grained system, one 
can determine how the state of the coarse-grained system y*^*) (the shadow state) changes 
over time without needing to determine how the state in the fine-grained system x^*) (the 
shadowed state) changes. Thus, even though for any t, one might not be able to determine 
x^*-*, one can always be confident that y*-*^ is its projection. Therefore, if the number of 
state variables of the coarse-grained system is small enough, one can numerically iterate 
the dynamics of the (shadow) state vector y^*) without needing to determine the dynamics 
of the (shadowed) state vector x^*). 
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In this paper we give sufficient conditions under which it is possible to coarse-grain the 
dynamics of an IPSGA such that the macroscopic variables are the frequencies of the family 
of schemata in some schema partition. If the size of this family is small then, regardless of 
the length of the genome, one can use the coarse-graining result to numerically calculate the 
approximate frequencies of these schemata over multiple generations in a computationally 
tractable way. Given some population of bitstring genomes, the set of frequencies of a 
family of schemata describe the multivariate marginal distribution of the population over 
the defined loci of the schemata. Thus another way to state our contribution is that we 
give sufficient conditions under which the multivariate marginal distribution of an evolving 
population over a small number of loci can be numerically approximated over multiple 
generations regardless of the length of the genomes. 

We stress that our use of the term 'coarse-graining' differs from the way this term has 
bee n used in other publica t ions w ithin the evolutionary computation literature. For instance 
in ( Stephens and Zamora . 20031 ) the term 'coarse-graining' is used to describe a reduction 



of the IPSGA equations such that each equation in the new system is similar in form to the 
equations in the original system. However the state variables in the new system are defined 
in terms of the state variables in the original system. Therefore a numerical iteration of 
the the new system is only computationally tractable when the length of the genomes is 
relatively short. Elsewhere the term coarse- graining has been inconsistentl y used to refer to 



"a co llection of subsets of the search space that covers the search s pace" (IContreras et al. 



20061 1 



2003), and as "just a function from a genotype set to some other set" (IBurjorjee and Pollack! . 



1.3.1 A Previous Coarse-Graining Result 

Wright et al.] (120031 1 have shown that the frequency dynamics of the genomes of a non- 



selective IPSGA can be coarse-grained such that the macroscopic variables are the frequen- 
cies of a family of schemata. However they argue that the dynamics of a regular selecto- 
mutato-recombinative IPSGA cannot be similarly coarse-grained "except in the trivia l case 
where fitness is a constant for each schema in a schema familv" ( Wright et al. . 20031 ) . Let 



us call this condition schematic fitness invariance. Wright et. al. imply that it is so severe 
that it renders the coarse-graining result essentially useless. 

This negative result holds true when there is no constraint on the initial population. 
In this paper we show that if we constrain the class of initial populations then it is possible 
to obtain a similar coarse-graining under a much weaker constraint on the fitness function. 
The constraint on the class of initial populations that is required for our coarse-graining is 
not onerous; it is easily met by a population that is uniformly distributed over the genome 
set. 

1.4 Structure of this Paper 

The rest of this paper is organized as follows: in the next section we define the basic 
mathematical objects and notation which we use to model the dynamics of an infinite 
population evolutionary algorithm (IPEA). This framework is very general; we make no 
commitment to the data-structure of the genomes, the nature of mutation, the nature 
of recombination, or the number of parents involved in a recombination. We do however 
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require that selection be fitness proportional. In section [3] we define the concepts of semi- 
coarsenablity, coarsenablity and global coarsenablity which allow us to formalize a useful 
class of exact coarse-grainings. In section U] and section [5] we proye some stepping-stone 
results about selection and variation. We use these results in section [6] where we prove that 
an IPEA that satisfies certain abstract conditions can be coarse-grained. The proofs in 
sections [5] and [6] rely on lemmas which have been relegated to and proved in the appendix. 
As stated above, the theoretical results obtained in sections HHH] are very general, and are 
applicable to any IPEA which meets the premises of the theorems in those sections. In 
sections [7] and [8] we develop theoretical machinery which allows us to apply the results 
of these sections to an IPSGA. In section [9] we specify concrete conditions under which 
IPSGAs with long genomes and non-trivial fitness functions can be coarse-grained such 
that the macroscopic variables are the frequencies of a family of schemata and the fidelity 
of the coarse-graining is likely to be high. Our argument is informal and requires the 
reader to make small leaps of intuition. In section [THl we explain why a direct experimental 
validation of the results of section [9] is computationally infeasible. We then make certain 
key assumptions and modeling decisions which allows us to indirectly validate these results. 
In section [11] we use the uncontroversial modeling decision that we make in section [10] to 
experimentally show that an SGA is capable of increasing the frequency of a low-order 
schema with higher than average fitness, even when the defining length of that schema 
is high. Our experiments show that the widespread belief that an SGA cannot do such 
a thing is misplaced. In our conclusion we reiterate the importance of obtaining a well- 
founded theory which explains the adaptive capacity of the SGA and specify the concrete 
contributions that we have made towards this goal. 



2. Mathematical Preliminaries 



Let X, Y be sets and let /3 : X — > y be some function. For any y G y we use the notation 
{y)p to denote the pre-image of y, i.e. the set {x E X | j3{x) = y}. For any subset A 'Z X 
we use the n otation 13(A) to d enote the set {y \ (3{a) = y and a & A} 

As in ( Toussaint . 2003al ). for any set X we use the notation A^ to denote the set 
of all distributions over X, i.e. A-^ denotes set {f : X ^ [0, 1] | XIxgx /(^) = !}■ For 
any set X, let 0^ : X {0} b e the constan t zero func t ion over X. For any se t X, an 
-parent transmission function ( Slat kin . 197d : Altenberg . 19941 : Toussaintl . 2003bl ) over X 



ni- 



ls an element of the set 

m+l 



Vxi, , 



Exte n ding t he notation introduced above, we denote this set by A^. Following 

(jXoussaintJ . [20o3), we use conditional probability notation in our denotation of trans- 
mission functions. Thus an TTi-parent transmission function T{x^ xi, • • • , ^m) is denoted 

T(x\x\ , . . . , Xfn)- 

A transmission function can be used to model the individual-level effect of mutation, 
which operates on one parent and produces one child, and indeed the individual-level effect 
of any variation operation which operates on any numbers of parents and produces one 
child. 
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For some genome set G, let Ti € and T2 G A^ be transmission functions that 
model m and n-parent variation operations respectively. Then the effect of applying these 
two operations one after the other can be modeled by a third transmission function: suppose 
the m-parent variation operation is applied to the output of the n-parent operation, then the 
transmission function that models the effect of the composite variation is given by Ti 0T2, 
where composition of transmission functions is defined as follows: 

Definition 1 (Composition of Transmission Functions) For any Ti g A;^,r2 G A^, 
we define the composition of Ti with T2, denoted Ti o T2, to he a transmission function in 
given by 

m 

(Ti oT2){x\yi, . . . ,yn) = ^ Ti(x| Xl 5 • • • 1 Xm) I I T2 {xi\yi , . . . , yn) 

{xi,...,Xm.)& « = 1 



Our scheme for modeling EA dynamics is based on the one used in (jToussaintlbnn.Sal ). 



We model the genomic populations of an EA as distributions over the genome set. The 
population-level effect of the evolutionary operations of an EA is modeled by mathematical 
operators whose inputs and outputs are such distributions. 

The expectation operator, defined below, is used in the definition of the selection 
operator, which follows thereafter. 

Definition 2 (Expectation Operator) Let X be some finite set, and let f : X ^ R'^ 
be some function. We define the expectation operator Sj : U 0'''" ^ U {0} as follows: 



The selection operator is parameterized by a fitness function. It models the effect of 
fitness proportional selection on a population of genomes. 

Definition 3 (Selection Operator) Let X he some finite set and let f : X ^ M+ he 
some function. We define the Selection Operator Sj : A^ — > A"''' as follows: 

The population-level effect of variation is modeled by the variation operator. This 
operator is parameterized by a transmission function which models the effect of variation 
at the individual level. 

Definition 4 (Variation OperatorII) Let X he a countable set, and for any m G N'^, let 
T G A^ be a transmission function over X. We define the variation operator Vj, : A^ A^ 



2. also called the Mixing Operator in (|Vosel \l99^ ) and l|Toussaintl. l2003al ') 
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as follows: 



{VtP){x) = J2 T{x\xi,. . . ,Xm)Y{p{xi) 



{xi,...,Xm) *=1 



The next definiti on describes the projection operator (previously used in ( Vose . 19991 ) 



and ( Toussaint . 2003al )). A projection operator that is parameterized by some function (3 



'projects' distributions over the domain of /3, to distributions over its co-domain. 

Definition 5 (Projection Operator) Let X be a countable set, let Y be some set, and 
let (5 : X be a function. We define the projection operator, : A"''' as follows: 

{E^p )(y) = p{x) 
and call Epp the (3-projection of p. 

3. Formalization of a Class of Coarse-Grainings 

The following definition introduces some convenient function-related terminology. 

Definition 6 (Theme map, Theme Set, Themes, Theme Class) Let X, K be sets and 
let P : X ^ K be a surjective function. We call (3 a theme map, call the co-domain K of 
P the theme set of (3, call any element in K a theme of (5, and call the pre-image {k)^ of 
some k € K , the theme class of k under (3. 

The next definition formalizes a class of coarse-grainings in which the macroscopic 
and microscopic state variables always sum to 1. 

Definition 7 (Semi-Coarsenablity, Coarsenablity, Global Coarsenablity) Let 

G, K be sets, let W : A*^ A*^ be an operator, let (3 : G —> K be a theme map, and let 
U C A*^ such that H^(C/) = . We say that W is semi-coars enable under (3 on U if there 
exists an operator Q : A^ such that for all p £ U, Q o Epp = o Wp, i.e. the 

following diagram commutes: 

W 

U — 

=/3 




Since (3 is surjective, if Q exists, it is clearly unique; we call it the quotient. We call 
G, K, W, and U the domain, co-domain, primary operator and turf respectively. If in addi- 
tion W{U) U we say that W is coarsenable under (3 on U. If in addition U = A'^ we say 
that W is globally coarsenable under [3. 

Note that the partition function of the coarse-graining is not the same as the theme map 
f3 of the coarsening. 
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Global coarsenablity is a stricter condition than coarsenablity, which in turn is a 
stricter condition than semi-coarsenablity. I t is easily s hown that global coarsenablity is 
equival ent to Vose 's notion of compatibility (IVosd . Il999l . p. 188) (for a proof see Theorem 



17.5 in lVosd . Il999l l. 



If some operator W is coarsenable under some theme map j3 on some turf U with 
some quotient Q, then for any distribution p^^ G ^^{U), and all distributions Px)h^' 
can study the projected effect of the repeated application of application of W to p^ simply 
by studying the effect of the repeated application of Q to pj^. If the size of K is small 
then a computational study of the projected effect of the repeated application of W to 
distributions in U becomes feasible. 



4. Global Coarsenablity of Variation 

We show that some variation operator Vt is globally coarsenable under some theme map if 
a relationship, that we call ambivalence, exists between the transmission function T of the 
variation operator and the theme map. 

To illustrate the idea of ambivalence consider a theme map (3 which partitions a 
genome set G into three subsets. Fig 1 depicts the behavior of a two-parent transmission 
function that is ambivalent under /3. Given two parents and some child, the probability that 
the child will belong to some theme class depends only on the theme classes of the parents 
and not on the specific parent genomes. Hence the name 'ambivalent' — it captures the 
sense that when viewed from the coarse-grained level of the theme classes, a transmission 
function 'does not care' about the specific genomes of the parents or the child. 

The definitio n of ambivalence that follow s is equivalent to but more useful than the 
definition given by Buriorjee and Pollack ( 20061 ) 



Definition 8 (Ambivalence) Let G, K he countable sets, let T G A^ be a transmission 
function, and let (3 : G ^ K be a theme map. We say that T is ambivalent under jS if there 
exists some transmission function Y G A^, such that for all k,ki, . . . ,km ^ K and for any 

Xi G (^l)^ ) • • • 5 G {^rn)p ; 



m ) 



T{x\xi,. . . ,Xm) =y{k\ki,. . . ,k 

If such a Y exits, it is clearly unique. We denote itbyT^ and call it the theme transmission 
function. 

Suppose T G A^ is ambivalent under some /3 : X — > if, we can use the projection 
operator to express the projection of T under /? as follows: for all k,ki, . . . , km G K, and 

any xi G (fci)^, . . . , G {km)^, {k\ki, . . .km) is given by (H^(r(- . . . , Xm)))(fc). 
The notion of ambiva l ence is equivalent to a generalization of Toussaint's notion of trivial 



neutrality (jToussaintl . l2003al . p. 26). A one-parent transmission function is ambivalent 
under a mapping to the set of phenotypes if and only if it is trivially neutral. 

The following theorem shows that a variation operator is globally coarsenable under 
some theme map if it is parameterized by a transmission function which is ambivalent under 
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Figure 1: Let /3 : G — > be a coarse-graining which partitions the genome set G into three 
theme classes. This figure depicts the behavior of a two-parent variation operator 
that is ambivalent under (3. The small dots denote specific genomes and the solid 
unlabeled arrows denote the recombination of these genomes. A dashed arrow 
denotes that a child from a recombination may be produced 'somewhere' within 
the theme class that it points to, and the label of a dashed arrow denotes the 
probability with which this might occur. As the diagram shows the probability 
that the child of a variation operation will belong to a particular theme class 
depends only on the theme classes of the parents and not on their specific genomes 



that therne map. The method by which we prove this theorem extends the method used by 
ToussaintI (|2003al ) in his proof of Theorem 1.2.2. 



Theorem 9 (Global Coarsenablity of Variation) Let G and K be countable sets, let 
T G be a transmission function and let (3 : G ^ K be some theme map such that T 
is ambivalent under (3. Then Vt ■ A*^ — > A*^ is globally coarsenable under f3 with quotient 
Vrp-^ , 2-6. the following diagram commutes: 




Proof: For any p G A*^, 

(H/3 o VTP){k) 

^ m 

1 



X\Xl, 
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T{x\xi,...,Xm)Y[p{Xi) 
{xi,...,x,-n) x£(k)g i=l 

m 



(xi,...,x„i) i=l x£{k)„ 



, . . . , ^jji J 



(fcl,...,fcm) (xi,...,Xm) * = 1 XG{k)g 



m m 

1 j=i 



(/Cl,...,fcm) (x-i,...,Xm) i = l 



m m 

1 j=i 



m 



G n (fci>,3 



{k\ ,...,^777,) 

m 



^ T^(A:|A;i,...,A;^)( 



(A;i,...,A;m) xie(/i;i) 

m 

1 

— > m , 

T^ik\ki,...,k^)ll(iEpp)ik,) 

(fcl,...,fem) i=l 

1 



E 



The so called implicit parallelism theorem ( Wright et al. . 20031 ) is similar to the the- 
orem above. Note however that the former theorem only shows that variation is globally 
coarsenable if firstly, the genome set consists of "fixed length strings, where the size of the 
alphabet can vary from position to position" , secondly the partition over the genome set is a 



aipnaoet can vary irom position to position , seconmy tne partition over tne genome set is i 
schema partition, and thirdly variation is 'structural' (see I Wright et al. . 20031 : Rowe et al. 



2004 for details). The global coarsenablity of variation theorem has none of these spe- 
cific requirements. Instead it is premised on the existence of an abstract relationship — 
ambivalence — between the variation operation and a theme map. The abstract nature of 
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this relationship makes this theorem apphcable to evolutionary algorithms other than GAs. 
In addition this theorem illuminates the essential relationship between 'structural' variation 
a nd schemata which w as used (implicitly) in the proof of the implicit parallelism theorem 
of I Wright etaD (l2003l ). 



In sections [7] and [8] we show that a transmission function that models any combination 
of variation operations that are commonly used in GAs — i.e. any combination of mask 
based crossover and canonical mutation, in any order — is ambivalent under any schema 
map (defined in definition I26p . Therefore any combination of common variation operators 
with canonical mutation is globally coarsenable under any schema map. This is equivalent 
to the result of the implicit parallelism theorem. 

5. Limitwise Semi-Coarsenablity of Selection 

For some fitness function / : G — M"*" and some theme map j3 : G ^ K let us say that / 
is thematically invariant under /3 if, for any thematically k G K , the genomes that belong 
to (/c)/3 all have the same fitness. Paraphrasing the comments of Wright et al. ( 20031 ) using 



the terminology developed in this paper, Wright et. al. argue that if the selection operator 
is globally coarsenable under some schema map (i : G ^ K then the fitness function that 
parameterizes the selection operator is 'schematically' invariant under (3. It is relatively 
simple to use contradiction to prove a generalization of this statement for arbitrary theme 
maps. 

Schematic invariance is a very strict condition for a fitness function. An IPSGA whose 
fitness function meets this condition is unlikely to yield any substantive information about 
the dynamics of real world GAs. 

As stated above, the selection operator is not globally coarsenable unless the fitness 
function satisfies thematic invariance, however if the set of distributions that selection 
operates over (i.e. the turf) is appropriately constrained, then, as we show in this section, 
the selection operator is semi-coarsenable over the turf even when the fitness function only 
satisfies a much weaker condition called thematic mean invariance. 

For any theme map (3 : G ^ K, any theme k, and any distribution p € A'-', the 
theme conditional operator, defined below, returns a conditional distribution in A*^ that is 
obtained by normalizing the probability mass of the elements in (/c)^ by {E.pp){k) 

Definition 10 (Tlieme Conditional Operator) Let G he some countable set, let K be 
some set, and let (3 : G ^ K be some function. We define the theme conditional operator 
C/3 : A*^ X ^ A'^ U 0'^ as follow: For any p e A'^ , and any k e K, Cp{p, k) G A*^ U 0'^ 
such that for any x G {k)p. 



{Cp{pM{x) = { ^^^^^^^^^ 



if f3{x) ^ k or {Epp){k) = {) 

{x) 



A useful property of the theme conditional operator is that it can be composed with 
the expected fitness operator to give an operator that returns the average fitness of the 
genomes in some theme class. To be precise, given some finite genome set G, some theme 
map j3 : G ^ K, some fitness function f : G ^ M"*", some distribution p G A^, and some 
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theme k £ K, Sj o C^{p,k) is the average fitness of the genomes in {k)p. This property 
proves useful in the foHowing definition. 

Definition 11 (Bounded Thematic Mean Divergence, Thematic Mean Invariance) 

Let G be some finite set, let K he some set, let jS : G ^ K he a theme map, let f : G ^ M"^ 
and f* : K ^ M.'^ he functions, let U C A*^, and let 6 G M^. We say that the thematic 
mean divergence of f with respect to f* on U under (3 is bounded hy 6 if, for any p G U 
and for any k G K 

\£foCp{p,k)-f*{k)\<5 

The next definition gives us a means to measure a 'distance' between real valued 
functions over finite sets. 

Definition 12 (Manhattan Distance Between Real Valued Functions) Let X he a 

finite set then for any functions f, h of type X ^ M we define the manhattan distance 
between f and h, denoted hy d{f,h), as follows: 

d{f,h) = J2\f(.^)-Kx)\ 

It is easily checked that d is a metric. 

Let f : G ^ M+, j3 : G ^ K and f* : K ^ M+ be functions with finite domains, 
and let U G A*^. The following theorem shows that if the thematic mean divergence of / 
with respect to /* on U under /? is bounded by some 6, then in the limit as 5 ^ 0, Sf is 
semi-coarsenable under P on U . 

Theorem 13 (Limitwise Semi-Coarsenablity of Selection) Let G and K he finite 
sets, let (3 : G ^ K be a theme map. Let U CA^ such that = A^, let f : G M+, 

f* : K ^ be some functions such that the thematic mean divergence of f with respect 
to f* on U under (3 is hounded by some 5 > 0, then for any p & U , 

lim d{Ep o Sfp, Sf* o Ef^p) = 

We depict the result of this theorem as follows: 

5/ 




Proof: Let e > 0. We prove that there exists a 5' > such that 

5 < 6' ^ d{Ei3 o Sfp,Sf* o Ei3p) < € 

For any k ^ K, 

{Epo Sfp){k) 
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^ f{9)-P{9) 



E f{g).{^ppm.{Cp{p,k)){g) 
= E E f{9').{Ef,p){k'){Cp{p,mg') 
{Epp){k) E f{9).{Cp{p,k)){g) 

E (H^Pg)(A;').£-/oC^(p,A;') 

= i<S£^oCgip,-)°'^PP)ik) 

So we have that 

d{Ep o 5/19, 5/* o Epp) = d{S£j^oCp(p,-) ° '^isP, 5/. o E^p) 
By Lemma [30l (in the appendix) there exists a 5i > such that, 

d{£f oC(3{p,.),f*) < <5i ^ d(5£-^oC/3{p,-)(^/3^')''5/*(^/3P)) < ^ 

Let 6' = p^. Now, ifj < 5', then d(£'joC/3(p, .), /*) < (^i, so d(H^o5/p, 5/. oH^ap) < e ■ 
The proof of the fohowing theorem should be obvious from the proof of the preceding one. 
We therefore omit it. 

Theorem 14 Let G and K be finite sets, let (3 : G ^ K he a theme map, let f : G ^ 
M+,/* : K M+ be functions such that for any k £ K, and for any g € {k)p, f{g) = f*{k). 
Sf is globally coarsenable under (5 with quotient Sj* , i.e. the following diagram commutes: 




6. Limitwise Coarsenablity of Evolution 

The two definitions below formalize the idea of an infinite population model of an EA, and 
its dynamic^. 

3. Tlie de finition of an Evolution Machine given here is different from that given bv lBurioriee and Pollaclj 
(|2005al lbh. The fitness function in this definition maps genomes directly to fitness values. It therefore 
subsumes the genome-to-phenotype and the phenotype-to-fitness functions of the previous definition. In 
previous work these two functions were always composed together; their subsumption within a single 
function increases clarity. 
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Definition 15 (Evolution Machine) An evolution machine (EM) is a tuple {G,T,f) 
where G is some set called the domain, f : G ^ M"*" is a function called the fitness function 
and T E is called the transmission function. 

Definition 16 (Evolution Epoch Operator) Let E = {G, T, f) be an evolution ma- 
chine. We define the evolution epoch operator Qe '■ A"-' —>■ A*^ as follows: 

Qe = Vt ° Sf 

The following theorem follows easily from theorems H] and [T^ 

Theorem 17 Let E = {G,T, f), E* = {K,T*,f*) be evolution machines such that G and 
K are finite. Let (3 : G ^ K be a theme map. For any k ^ K , and for any g S (^)/3, let 

f{g) = f*{k). let T* = . Then Qe is globally coarsenable under j3, i.e. the following 
diagram commutes: 




The condition that / be constant over each theme class is too severe for this theorem to be 
of any use in practice. We seek a coarsenability result where / is less severely constrained. 

Definition 18 (Non-Departure) Let E = (G, T, /) be an evolution machine, let U C 
A*^, and let p G U. For any r € , we say that E on p is non-departing from U for r 
generations if for all t E Z"*" such that t < t, 

Qe{p) e u 

Theorem 19 (Limitwise Coarsenablity of Evolution) Let E = {G,T,f ), be an evo- 
lution machine such that G is finite, let (3 : G ^ K be some theme map, let f*:K^ M"^ be 
some function, let 6 € M^, let U C A*^ such that = , let p G U, and let r G Z+. 

Suppose that the following statements are true: 

1. The thematic mean divergence of f with respect to f* on U under (3 is bounded by 6 

2. T is ambivalent under jS 

3. E on p is non- departing from U for r generations 

Then, letting E* = {K,T ^ , f*) be an evolution machine, for any t € such that t <t, 
we have that 

lim d{Efs o g^^p , o Epp) = 

0— »0 
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where the equation above is depicted as follows: 




Proof: For any t < r we prove that for any e > there exists a S' > such that 

6<5' ^ d{Ep o , g^j^, o E^p) < e 

The proof is by induction on t. The base case, when t = 0, is trivial. For some n = Zq", 
such that n < r, let us assume the hypothesis for t = n. We now show that it is true for 
t = n + 1. Note that, 

diEpog^/'p,gp:'oEpp ) 

= d{Ep o Vt o 5/ o g'^p , V^j o Sf* o o Ef^p ) 

= d{V^-^ oE/soSfO g'^p , V^y o 5/. o (/g. o E/sp) (by theorem [9]) 

Hence, for any e > 0, by Lemma [28] there exists 5i such that 

diEp o Sf o g^^p , Sf, o g^, o E^p) < <5i ^ d{Ep o , g^t' oEpp)<e 

As (i is a metric it satisfies the triangle inequality. Therefore we have that 
d{Ep oSfO g^P , Sf* o o Ef^p) < 

d{Ei3 oSfo g^p , Sf* 0E130 gEP)+ 

d{Sf* oEfso g^p , Sf* og"^, o Epp) 

By the definition of departure, g'^p G U . So, by theorem [T3l there exists a 82 such that 

5<62^ d{EpoSf o gip, Sf. o o g^p) < ^ 
By lemma [291 there exists a 63 such that 

d(S^ o g|p , gg, o H/3P) < ^3 ^ d{Sf, o o g^p , 5/* o g^g, o Epp) < I 
By our inductive assumption, there exists a 64 such that 

<5 < <54 ^ d{Efi o , o Efip) < 53 
Therefore, letting S' = min((52, ^4) we get that 

5<6'^ d{Ep o gi+^p, gi+^ o Epp) < e ■ 
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The limitwise coarsenability of evolution theorem is very general. As we have not 
committed ourselves to any particular genomic data-structure the coarse-graining result we 
have obtained is applicable to any IPEA provided that it satisfies three abstract condi- 
tions: bounded thematic mean divergence, ambivalence, and non-departure. The fidelity of 
the coarse-graining depends on the the minimal bound on the thematic mean divergence. 
Maximum fidelity is achieved in the limit as this minimal bound tends to zero. 

Note that apart from the way it is parameterized, the quotient operator is the same as 
the primary operator. We therefore say that the coarse-graining is operationally invariant. 
This feature is very significant. It means that the frequency dynamics of the themes of 
an IPEA which satisfies the three abstract conditions mentioned above, can be accurately 
predicted by the frequency dynamics of the genomes of another IPEA, for some number 
of generations, provided that the minimal bound of the thematic mean divergence of the 
former IPEA is sufficiently low. 

7. The Algebra of Ambivalence 

Given several transmission functions Ti , . . . T„ that are all ambivalent under some theme 
map f) we show two ways of combining Ti , ... to create a new transmission function that 
is also ambivalent under (3. Also, given a single transmission function T and several theme 
maps . . . , /3„ such that T is ambivalent under each of these theme maps, we show how, 
if a certain condition is met, the theme maps can be combined to create a new theme map 
such that T will be ambivalent under this new theme map. 

7.1 Combining Transmission functions 

Given several ambivalent transmission functions over some set, the following two lemmas 
give us two ways to create new ambivalent transmission functions — firstly by taking a 
weighted sum, and secondly by composition. 

Lemma 20 , (The Weighted sum of Ambivalent Transmission Functions is Am- 
bivalent) For any set X, any function /5 : X — > K, and any n G N''", let Ti, . . . ,Tn € A^ 
be ambivalent transmission functions. Letp € A^^'^'^'^J' and letT G be defined as follows: 



n 




i=l 



Then T is ambivalent under (3 with ^-projection given as follows: 



n 



4=1 
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Proof: For any xi G (/ci)^, . • • , G (/c^)^ 

3^1 1 • • • 1 ^^m j 

n 

= ^ 'Yp{i)Ti{x\xi,. . . ,Xm) 

x&{k)^ i=l 
n 

n 

= Y,p{i)T^{k\ki,...,k^) ■ 
j=l 

Lemma 21 (The Composition of Ambivalent Transmission Functions is Am- 
bivalent) For any Ti G A;^,T2 € Aj^, i/Ti and T2 are both ambivalent under some theme 
map P : X ^ K. Then Ti o T2 is ambivalent under (3 with fi -projection given as follows: 

— > m , 

{TioT2)^{k\h,...,jn)= Yl Tfik\ki,...,km)llT^^{h\jl,...,jn) 

{ki,...,km) «=1 

Proof: For any yi G (ji)^ , • • • , yn G On)/3 , 
(rioT2)^(A:|ii,...,j„) 



m 



Y ^ Ti{x\xi,...,Xm)Y[T2{xi\yi,...,yn) 

xe(k)^ {a;i,...,x™)enfX i=l 

m 

Y ^ X] ri(x|xi,...,Xm) JJr2(xi|2/i,...,y^ 



1=1 



Y J2 5Z ]^T2(xi|yi, . . . ,y„)ri(x|xi, . . . 

X] X] X] ^2(xi|?/i,...,y„) ^ 72(x2|yi, . . . ,yn) 

a;G(A;)^ (fci,...,fc,„)Gn™ A xiG(/ci)^ 2:26(^2)^ 

X] r2(x3|yi, • • • ,yn) • • • X] 72(Xm|yi,...,yn)Ti(x|xi,...,Xm) 

3;3G(fc3>^ Xm&{km)fj 

Yj ^ T2{xi\yi,...,yn) Yj ^2(x2|yi, • • • ,yn) 

(fel,...,fc,„) a;iG(fci)o X2&{k2)p, 
Gn™A 

X] ^2(x3|yi, • • • ,yn) • • • X 72(Xm|yi, . . . ,yn) X ri(x|xi, . . . ,X™J 

a;3G(fc3)^ Xm&{km)iJ 
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X] T2{xi\yi,...,yn) T2{x2\yi,. . . ,yn) 

X] T2{x^\yi,. . . ,yn) ■ ■ ■ X T2{xm\yi,- ■ ■ ,yn)Tf {k\ki,. . . ,k 

3^3 € {^3 )^ 3^777, G (/Cm )^ 

^ r/ (/c|fci, . . . , /cm) X r2(xi|yi, . . . . . 



(fei,...,fc,„)Gn™i^ 



a;-ie(fci),, 



{hi , . . . ,fcm ) 



^ r/(/c|A;i,...,A;m)( ^ T2{xi\yi, . . . ,yr, 



X 72(j;m|yi, • • • 

T2{Xm\yi,---,yn. 



Y '^l (^1^1 ' • • • ' ^rn) n b'l > ■ ■ ■ > 3n) 



7.2 Combining Theme maps 

Given several theme maps with the same domain, the following definition allows us to create 
a new theme map which induces a finer partition over the domain. 

Definition 22 (Cartesian Product of Theme maps) For any n G N+ and any func- 
tions Pi : X ^ Ki,. . . , f3n '■ X ^ Kn, which share the same domain we define the cartesian 
product o/ /?!,..., /3n to be the function /3i x . . . x /?„ : X — YYi=i follows: 

X ... X (3n){x) = {(3i{x),...,(3n{x)) 

For notational convenience we will denote some cartesian product /3i x . . . x /?„ as YVi=i Pi- 
Given two theme maps with the same domain /?i : X — > Ki and P2 ■ X ^ K2 
and some transmission function T G A^, we say that f5\ and (52 are independent with 
respect to T if for any choice of m parents xi, X2, . . . , Xm-, the mutual information between 
H^^(r(-|xi, . . . iXm)) and 'Ep^{T{-\xi, . . . is zero. In other words knowing something 

about the distribution 'Ep-^{T{-\xi, . . . ,Xm)) gives no information about the distribution 
S/32(r(-|xi, . . . ,Xm)). Formally, for all ki G Ki, k2 G K2 

'^(3lXf32{T{-\xi, . . . ,Xm)){kl,k2) = H^i(r(-|xi, . . . ,Xm)){kl)EfS^{T{-\xi, . . .,Xm)){k2) 

The following definition extends this idea of independence to multiple theme maps. 
Unfortunately it defines independence in a form that is more suited for use in proofs than 
for understandability. The proposition that follows shows that this form is equivalent to 
the more intuitive form that we used above. 

Definition 23 For any set X and any functions /3i : X — >■ Ki,. . . , f3n '■ X ^ Kn, and any 

T G Aj^, we say that j3i,. . . ,f3n cifs independent with respect to T if for all xi, . . . , Xm G X 
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and all ki € Ki, . . . , kn (z K, 




Proposition 24 For any set X and any functions /?i : X — Ki,. . . , /3„ : X — > Kn, and any 
T € A;^, /?!,.. . ,f3n are independent with respect to T if and only if for all xi, . . . , Xm G X 
and all ki G Ki, . . . , kn & Kn, 



"nr=i ft (^(' 1^1' • • • ' Xr,i)){{ki,. . . , kn)) = 

^fSi{T{- \xi, . . .,Xm)){ki) . . . H^„(r(- \xi, . . .,Xnt)){kn) 

Proof: This proposition follows directly from the definition of the projection operator (def 
[5]) and from the observation that 

{iki,...,kn))Yi-^^fS^ = {ki)p^ri,...,ri{kn)p„ ■ 

The next lemma shows that given some transmission function T and some theme 
maps Pi, . . . ,f3n that are independent with respect to T, if T is ambivalent under each of 
the theme maps, then T will also be ambivalent under the cartesian product of the theme 
maps. 

Lemma 25 For any n € N"*" let /3i : X — > Ki,. . . , /3„ : X — Kn he functions which 
share the same domain. For any m G N"*" let T G Aj^ he a transmission function such 
that Pi, . . . , (3n are independent with respect to T and such that for all i G {1, . . . , n}, T is 

amhivalent under Pi. Then T is ambivalent under HILi TYi=i Pi-pfojection rnj=ift 

given as follows, 

T^^^{iki,...kn)\{k\,...,k^^),...,{kl,...,k^n,)) = 

T^^(A;i|A;|, . . . , fc^) . . . T^"(/c„|A;", . . . , U^) 
Proof: For all xi G ((/cj, . . . , fcf ))^„^^ . . . , G ((A;^, • • • , A^^J,))]^"^^ ^3, 

,ki), . . . , {km, . . . , km)) 

• • ) Xm) 

• ) Xm) 



T^^=l^^i{ki,...kn)\{kl,... 

= XI T{x\xi,. 

({fci,...,fc„)>[-[n^^^. 

= X T{x\xi,.. 

(fci). n,...n(fc„>^^ 
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As /5i , . . . , /3„ are independent with respect to T 




— T^^ {ki\ki , . . . , . . . T^" {kn\ki , ■ ■ ■ , k^) M 



8. On the Ambivalence of Common Variational Operators Used in SGAs 

The cartesian product structure of the genome set and the nature of the variation operations 
commonly used in GAs makes it possible to obtain ambivalence results for transmission 
functions that model any of the common variation operators of an SGA. 

The following is some useful notation for dealing with SGAs. For any n € Z"*", let 
*B„ be the set of all bitstrings of length n. For any x S and any i G {1, . . . , n}, let Xi 
denote the i*^ locus of x. 

Let £ G N"*". For any i € N+ such that i < i let : ^£ ^ {0, 1} be a theme map such 
that ^i(x) = Xi, i.e. maps a bitstring to the value of its i^^ locus. 

Definition 26 Let I = {ji, . . . , be a subset of {1, ...,£}, such that ji < . . . < j^j^ . Let 
-.'^i —> *B[/[ denote the theme map ^j-^ x . . . x ^j^^^ . We call such a theme map a schema 



Thus a schema map, maps any element of some schema h to the bitstring that is obtained 
when all the wildcards are stripped out of the schema template of h. 

8.1 Ambivalence of Canonical Mutation 

Let us call a mutation operation that flips each bit independently with some fixed proba- 
bility a canonical mutation and let M E A-^ * be a transmission function that models this 

operation. Observe that for any i G {1, ...,£}, M is ambivalent under with M^' G A^^ 
given as follows: 



Observe that for any / G I, such that / = {ji , . . . , } and ji < . . . < , , . . . , 
are independent with respect to M as M models a mutation operator that flips bits inde- 
pendently of each other. Hence, by lemma [251 ^ is ambivalent under ^/ with M^' € Aj^ 
given by 



Multiplication by 1 at the end of the right hand side of this expression is necessary to 
account for the case when / is the empty set. 



map 
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8.2 Ambivalence of Mask Based Crossover 

Let ^ be the set of all masks for length i bitstrings, i.e. !B£ is itself a set of bitstrings in 53^. 
For any mask ^ € let the 2-parent transmission function T^, E A^*^ be defined as follows: 



T^i,ix\y,z) 



1 ifVie {1,. ..,£,}, 

{Xi =yiAiPi = 0) V 
{Xi = A V'i = 1) 
otherwise 



Note that for any parents y, z T^(-|y, 2) is a discrete delta function which concentrates 
all its distribution mass on one child. In other words, is 'deterministic'. 

For all ■0 and for alH € {1, . . . , see that is ambivalent under with tJ' G ^2'^'^"'^ 
given as follows: 

r 1 if (fc = / A Vi = 0) V 

r|'(A;|/,m) = < {k = mN^i = \) 

I otherwise 

Observe that for any / € I, such that / = {oi, . . . , a|7|} and a\ < ... < a|7|, 
(,ai, ■ ■ ■ 1 '^a|j| are independent with respect to because for any two parents y, z the mutual 
information between the distributions {T^{-\y, z)), . . . , '^(a^j^ {T^>i'\y^ ^)) is zero. (This is 
because the distribution E^-{T^{-\y, z)) depends only on the values yi,Zi, and ipi). Hence, 
by lemma \25\ is ambivalent under ^/ with given as follows: 

Tf{k\ /, m) = Tf^ (fciUi, mi) . . . ^^'^^ (A;,,, | m,,,).! 

Multiplication by 1 at the end of the right hand side of this expression is once again necessary 
to account for the case when / is the empty set. 

For some choice of distribution over the set of all masks q € A^, let T G A^^ be given 
as follows: 

T{x\y,z)= q{ip)T^{x\y,z) 

As stated above, for all ip € 53£, T^, is ambivalent under (^j. Therefore by lemma [20l T is 
ambivalent under ^/ with T^' € Ag ^ given by 

T^'{k\l,m)= qWTf{k\l,m) 

By appro priately choos i ng q, T can be made to model any ^-point or uniform crossover 
operation IWright et al.l ( 20031 ). so any ^-point or uniform crossover operation is ambivalent 



under ^j. 

Finally, by lemma [2T] any composition of n-point or uniform crossover operations with 
the canonical mutation operation, in any order, is ambivalent under 
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9. Sufficient Conditions for Coarse-Graining IPSGA Dynamics 

We now use the result in the previous section to argue that the dynamics of an IPSGA 
with long genomes, uniform crossover, and fitness proportional selection can be coarse- 
grained with high fidelity for a relatively coarse schema map, provided that the initial 
population satisfies a constraint called approximate schematic uniformity and the fitness 
function satisfies a constraint called low-variance schematic fitness distribution. We stress 
at the outset that our argument is principled but informal, i.e. though the argument rests 
relatively straightforwardly on theorem 3, we do find it necessary in places to appeal to 
the reader's intuitive understanding of GA dynamics. In the following chapter we will 
experimentally validate our conclusions. 

For any n € Z"*", let QS^ be the set of all bitstrings of length n. For some i ^ 1 
and some m <C ^, let /? : iB^ — 03^ be some schema theme map. Let /* : 53 ,„ —<■ 
be some function. For each k E 53m, let Dj^ € A'*^ be some distribution over the reals 
with low variance such that the mean of distribution Dk is f*{k). Let / : 53^ — > be a 
fitness function such that for any k € 53m, the fitness values of the elements of {k)p are 
independently drawn from the distribution D^- For such a fitness function we say that 
fitness is schematically distributed with low-variance. We call the distributions Dj^ schema 
fitness distributions. 

Let U he a set of distributions such that for any k £ 03^ and any p £ U, Cp{p, k) is 
approximately uniform. It is easily checked that U satisfies the condition '^^{U) = A®"". 
We say that the distributions in U are approximately schematically uniform. 

Let 6 be the minimal bound such that for all p € [/ and for all k € 53m, \Sf°C/3{Pi ^) ~ 
f*{k)\ < 5. Then, for any e > 0, P{5 <e)— >las£ — m— > oo. Because we have chosen i 
and m such that i — m is 'large', it is reasonable to assume that the minimal bound on the 
schematic mean divergence of / on U under f3 is likely to be 'low'. 

Let T G be a transmission function that models the application of uniform 
crossover. In section [8] we proved that a transmission function that models any mask based 
crossover operation is ambivalent under any schema map. Uniform crossover is mask based, 
and /3 is a schema map, therefore T is ambivalent under /3. 

Let pi € A*®! be such that pi{0) = ^ and pi{l) = \. For any p U, Sfp may be 
'outside' U because there may be one or more k € 53m such that Cp{Sfp,k) is not quite 
uniform. Recall that for any k € ^m the variance of is low. Therefore even though 
Sfp may be 'outside' C/, the deviation from schematic uniformity is not likely to be large. 
Furthermore, given the low variance of D^, the marginal distributions of Cfs{Sfp, k) will be 
very close to pi. Given these facts and our choice of transmission function, for all k £ K, 

2 

CpiVx ° Sfp, k) will be more uniform than Cp{Sfp, k), and we can assume that Vt o Sfp is 
in U. In other words, we can assume that E is non-departing over U. 

Let E = (53^, T, /) and E* = (53m, , /*) be evolution machines. By the discussion 
above and the limitwise coarsenablity of evolution theorem one can expect that for any 
approximately thematically uniform distribution p £ U (including of course the uniform 
distribution over ^B^), the dynamics of E* when initialized with H^p will approximate the 
projected dynamics of E when initialized with p. As the bound 6 is 'low', the fidelity of the 
approximation will be 'high'. 
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Note that the constraint that fitness be low-variance schematically distributed, which 
is required for this coarse-graining, is much weaker than the very strong constraint of 
schematic fitness iii variance (all genomes in each schema must have the same value) which 
Wright et al] tooi ) argue is required to coarse-grain IPSGA dynamics. 



10. Experimental Validation 

Let be some family of schemata of order o and length i, and let : .7-" ^ *Bo be the bijection 
that maps any schema in JT to the bitstring that is obtained when all the wildcards are 
removed from that schema. Consider an IPSGA with genome set *B^, fitness proportional 
selection, and uniform crossover. Let its fitness function be such that fitness is low-variance 
schematically distributed with respect to J- and let the function : T ^ A'^^ be a mapping 
between the schemata in and their corresponding low-variance fitness distributions. Let 
the initial population be uniformly distributed over the genome set. We shall refer to this 
infinite population IPSGA as iPSGAl. A direct experimental validation of the conclusions 
of section [9] would show that the exact dynamics (i.e. frequencies over multiple generations) 
of the schemata in under the action of IPSGAl can be approximated by the coarsened 
dynamics of this IPSGA, where the theme map used in the coarsening is a schema theme 
map which maps the genomes in ^B^ to the schemata in J^. As per the conclusions in section 
[9] these coarsened dynamics are given by the dynamics of an IPSGA with genome set !Bo, 
fitness proportional selection, uniform crossover, uniformly distributed initial population, 
and fitness function ip o We shall refer to this IPSGA as ipsga2. 

As described in section 11.21 an exact calculation of the dynamics of the schemata in 
J- under IPSGAl has time complexity that is exponential in i. Therefore such a calculation 
quickly becomes infeasible as i gets large. However, because the conclusions in section [9] 
are premised upon o being much smaller than £, in order to experimentally validate these 
conclusions i needs to be large. 



10.1 An Indirect But Computationally Feasible Approach to Experimental 
Validation 

We resolve this dilemma by making a key assumption and a particular modeling decision. 
Consider an SGA with genomes of length i, fitness proportional selection, uniform crossover, 
a finite population of size N ^ 2° and the same fitness function as the IPG A described 
above. We will assume that the dynamics of the schemata in JF under the action of this SGA 
approximates the dynamics of same family of schemata under the action of the IPSGA. 
Tracking these dynamics in the former case is exponential in o and linear in i, and is 
therefore feasible when i is large, provided that o is small. 

For genomes of any significant length, predetermining the fitness values of all possible 
genomes that may be generated by the SGA is computationally infeasible. So instead what 
if each time a genome is generated the SGA determines its fitness value "on the fly" by 
drawing a random sample from the distribution corresponding to the the (unique) schema 
in that the genome belongs to? There is a problem with this approach. If a genome is 
generated more than once during the course of a run it is unlikely to be assigned the same 
fitness value each time it gets generated. One way to deal with this problem is by storing 
all generated genomes and their fitness values and using this information to ensure that 
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Dynamics of the schemata in 
T under the action of iPSGAl 




Dynamics of tlie genomes in Dynamics of the schemata in 

fBo mider the action of IPSGA2 T under the action of SGA 




Dynamics of the genomes in 
?8o under the action of SFSGA 



Figure 2: A diagrammatic depiction of the assumption and modeling decision which allows 
us to indirectly compare the actual frequency dynamics of the schemata of length 
I and order o in a schema family T under the action of iPSGAl with the frequency 
dynamics of these schemata predicted by a coarsening of the genomic frequency 
dynamics of IPSGAl which uses a schema map from QS^ to and partitions 53^ 
into the schemata in T . 



each genome is always assigned a unique fitness value by the fitness function. A second, far 
simpler approach, is based on the following observation: as i gets larger the probability that 
some genome will be generated more than once during an evolutionary run gets smaller. 
Therefore by letting i be "large enough" we can assume that the same genome is never 
generated more than once during some finite number of evolutionary runs (provided that 
the number of generations in each run is also finite). In other words, by letting i be "very 
large" we are free to determine the fitness value of each generated genome "on the fly" and 
do not need to maintain a store of previously generated genomes and their fitness values. 

Even though the time complexity of the SGA described above is linear in for large 
enough i this SGA becomes infeasible to execute. So a question that seems to require a 
precise answer is "how large is 'very large'?" We will now argue that a precise answer to 
this question is not required. Let us call the SGA described in the previous paragraph 
SGA. Let SFSGA be an SGA with population size A^, genome set *Bo, fitness proportional 
selection, uniform crossover, ctnci ci stochastic fitness function, which, for ciny genome q G 
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returns a sample from the distribution ^ o ip"^[g). Then with a httle thought the reader 
can see that the frequency dynamics, under the action of SGA, of any schema k ^ T can is 
modeled by the frequency dynamics, under the action of SFSGA, of the genome S,{k). Since 
we can use SFSGA to model the frequency dynamics of the schemata in T under the action 
of SGA the answer to the question "how large does £ need to be" is simply "large enough 
that it is unlikely that any genome in *B£ is generated more than once during the course of 
an experiment". 

The dynamics of ipsga2 can feasibly be calculated when o is small (regardless of the 
length of t). Therefore when o is small it is possible to compare the coarsened dynamics 
of the schemata in T with an approximation of these dynamics obtained by using SFSGA. 
The diagram in Figure [2] displays the scheme described above in graphical format. 

10.2 Experiments 

For a number of different parameter regimes we now compare the dynamics of the genomes 
in !Bo under the action of IPSGA2 with the dynamics of the genomes in *Bo under the action 
of SFSGA. In each experiment each genome g € is associated with some value fg chosen 
from the interval [2,3]. We call this value the f-value of that genome. The stochastic fitness 
function of SFSGA is such that whenever g is generated it assigns to 5^ a fitness value sampled 
from the distribution M{fg, o"^) (where M{m, n) denotes the normal distribution with mean 
m and variance n) . If the sample is less then then the stochastic fitness function returns 
the value 0. The standard deviation a is 0.8 in all our experiments. The fitness function of 
IPSGA2 simply maps the genome g to the value fg. The maximum number of generations 
in all experiments is 30. Each experiment consists of a single run of ipsga2 and r runs of 
SFSGA. For each genome, the frequencies of that genome in each generation across all runs 
is averaged and plotted. For values of r greater than 1, the error bars in the plots give the 
standard deviation of the frequency of the genome in each generation. In each experiment 
the size of the population of SFSGA is denoted by N. 

10.2.1 Experiment 1 

In the first experiment o = 3, N = 2000, and r = 1. Figure [3] shows the (frequency) 
dynamics of the genomes in iBa under the action of IPSGA2 and SFSGA. 

10.2.2 Experiment 2 

The next experiment builds confidence that the similarity between the dynamics of the 
genomes under the action of SFSGA and the dynamics of the genomes under the action of 
IPSGA2 in experiment 1 was no fluke. Figure H] shows the result of this experiment in which 
we raise the value of r to 40. 

10.2.3 Experiment 3 

Figure [5] shows the results of our fourth experiment in which we increase the value of N to 
20000. Note how the error bars are much smaller than those in experiment 3.. 
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10.2.4 Experiment 4 

The error bars are smaller still in figure [6] in which the consequence of further increasing 
the value of N to 100000 can be seen. From this experiment and the previous two, one can 
infer that as the population size of the SFSGA gets larger the dynamics of the genomes under 
the action of SFSGAmore closely mirrors the dynamics of the genomes under the action of 
IPSGA2 



10.2.5 Experiment 5 

Experiment 5 validates this inference. We set r back to 1, and set N to 400000. Figure 
[7] shows the result of this experiment. Note the close match between the dynamics of the 
genomes under the action of ipsga2 and SFSGA 



10.2.6 Experiment 6 

In the final experiment we show that there is nothing special about o = 3. Figure [8] shows 
the results of an experiment in which o = 4, = 200000 and r = 10. 



11. Rescuing the SGA Prom a Perceived Limitation 

We can use the model in the previous section (sfsga) to debunk a common misconception 
about SGAs which, in our opinion, will be retrospectively judged to be a significant barrier 
to the discovery of an accurate theory of adaptation for the SGA. It is widely assumed that 
SGAs are incapable of increasing in the frequency of a low-order sche ma with above average 
fitness when th e defining bits of that schema are widely dispersed ( Goldberg et al. . 19891 : 



Goldberg] . |2002| ) . We now argue that this assumption is false. Our argument is independent 
of the theoretical work in sections [2H9] however it does rely on the uncontroversial modeling 
decision that we made in section [THl 

Consider an SGA with long genomes. For a concrete example let us say that the 
genomes are of length 2 x 10^ + 1. Suppose that selection is fitness proportional, crossover is 
uniform, the population size of the SGA is 1000, the the initial population is always drawn 
from a uniform distribution over the genome space, and fitness is low- variance schematically 
distributed with respect to some family of schemata J- of order 3. Suppose the defining bits 
of the schemata are in positions 1, 10^ + 1, and 2 x 10^ + 1. Let this SGA play the role of SGA 
in the discussion in section [TOl Then as described in section[TO]the the frequency dynamics of 
the schemata in T under the action of this SGA can be modeled by the frequency dynamics 
of the genomes in ^Bs under the action of an SGA with a stochastic fitness function SFSGA. 

Figures [9Hl4| present the results of a series of six experiments in which we plot 
the frequency dynamics of the genomes in QSs under the action of SFSGA (with uniform 
crossover, fitness proportional selection and population size 1000). Each experiment uses a 
different assignment of f- values to the genomes in ^83. In all experiments except the last 
one, the f-values are randomly chosen from the interval [2,3] and assigned to the genomes in 
*Bo. From the results it can be seen that in each case the genome with the highest frequency 
at the end of 300 generations is one with an above average f-value. However it should be 
noted that this genome isn't always the one with the highest f-value (see experiments 10 
and 12). From these results it can be concluded that SGA as described in the preceding 
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Frequency Dynamics of Genome 000, fg^Q^ 2.8639 Frequency Dynamics of Genome 001, fg^j^ 2.9062 




Frequency Dynamics of Genome 010, fg^Q^ 2.2001 Frequency Dynamics of Genome Oil, f^j ^= 2.8594 
0.2 I 1 1 1 1 1 n 0.2 I 1 ^ ^ 1 1 




Figure 3: Results of Experiment 1 (o = 3, = 2000, r = 1). A series of eight plots 
showing the dynamics of the eight genomes in 533 under the action of ipsga2 
and SFSGA. The independent axis in each plot shows the generation number, and 
the dependent axis gives the frequency of a genome in a population. The title of 
each plot displays the genome and its f-value. In each plot the thick light grey line 
shows the frequency dynamics of the genome under the action of ipsga2. The 
thin black line shows the frequency dynamics of the genome under the action of 
SFSGA 
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Frequency Dynamics of Genome 000, ff^r^= 2.8639 



Frequency Dynamics of Genome 001, f„„,= 2.9062 
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Frequency Dynamics of Genome 010, fn,r.= 2.2001 
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Figure 4: Results of Experiment 2 (o = 3, = 2000, r = 40). A series of eight plots 
showing the dynamics of the eight genomes in under the action of ipsga2 
and SFSGA. The independent axis in each plot shows the generation number, and 
the dependent axis gives the frequency of a genome in a population. The title of 
each plot displays the genome and its f-value. In each plot the thick light grey 
line shows the frequency dynamics of the genome under the action of ipsga2. 
The thin black line shows the average frequency dynamics of the genome under 
the action of SFSGA over 40 runs. The error bars show one standard deviation 
above and below the average in each generation. 
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Frequency Dynamics of Genome 000, ff^r^= 2.8639 



Frequency Dynamics of Genome 001, f„„,= 2.9062 
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Frequency Dynamics of Genome 010, fn,r.= 2.2001 
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Figure 5: Results of Experiment 3 (o = 3, = 20000, r = 40). A series of eight plots 
showing the dynamics of the eight genomes in *B3 under the action of ipsga2 
and SFSGA. The independent axis in each plot shows the generation number, and 
the dependent axis gives the frequency of a genome in a population. The title of 
each plot displays the genome and its f-value. In each plot the thick light grey 
line shows the frequency dynamics of the genome under the action of ipsga2. 
The thin black line shows the average frequency dynamics of the genome under 
the action of SFSGA over 40 runs. The error bars show one standard deviation 
above and below the average in each generation. 
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Frequency Dynamics of Genome 000, fgQQ^ 2.8639 Frequency Dynamics of Genome 001, fQn,= 2.9062 




Frequency Dynamics of Genome 010, Iq^q^ 2.2001 Frequency Dynamics of Genome 01 1, fg^j^ 2.8594 
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Figure 6: Results of Experiment 4 {o = 3, N = 100000, r = 40). A series of eight plots 
showing the dynamics of the eight genomes in ^Bs under the action of ipsga2 and 
SFSGA. The independent axis in each plot shows the generation number, and the 
dependent axis gives the frequency of a genome in a population. The title of each 
plot displays the genome and its f-value. In each plot the thick light grey line 
shows the frequency dynamics of the genome under the action of ipsga2. The 
thin black line shows the average frequency dynamics of the genome under the 
action of SFSGA over 40 runs. The error bars show one standard deviation above 
and below the average in each generation. 
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Figure 7: Results of Experiment 5 (o = 3, = 400000, r = 1). A series of eight plots 
showing the dynamics of the eight genomes in iBs under the action of ipsga2 
and SFSGA. The independent axis in each plot shows the generation number, and 
the dependent axis gives the frequency of a genome in a population. The title of 
each plot displays the genome and its f-value. In each plot the thick light grey line 
shows the frequency dynamics of the genome under the action of ipsga2. The 
thin black line shows the frequency dynamics of the genome under the action of 
SFSGA 
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Figure 8: Results of Experiment 6 (o = 4, = 200000, r = 10). A series of sixteen plots 
showing the dynamics of the sixteen genomes in QS4 under the action of ipsga2 
and SFSGA. The independent axis in each plot shows the generation number, and 
the dependent axis gives the frequency of a genome in a population. The title of 
each plot displays the genome and its f-value. In each plot the thick light grey 
line shows the frequency dynamics of the genome under the action of ipsga2. 
The thin black line shows the average frequency dynamics of the genome under 
the action of SFSGA over 10 runs. The error bars show one standard deviation 
above and below the average in each generation. 
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paragraph can effect an increase in the frequency of a low-order schema with above average 
fitness even when the defining bits of that schema are widely dispersed (in the example we 
are considering, the defining bits are almost a billion loci apart!) 

Like many unfounded assumptions about the SGA, the assumption that it is in- 
capable of_2erforming the feat described above originated in Holland's seminal work 
( Hollandl. 1975), aii d subsequently received unqualified support in the popular textbook 
bv iGoldbereJ (|l989l ^. Overcoming this perceived limi tation is the ra i son d ' etre for at 



least two new ad aptation algorithms — messy GA ( Goldberg et alj . Il989l : [Goldberg 



2003) and LLGA (|Harik and Goldberd . Il997l : iGoldberd. 120021. F urthermore the inven 



tors of several other adaptation algorit h ms — CGA (IHarik et all Il999ll. ECGA ( Hari 



199S 



19991') ■ FDA dMiihlenbein and Mahniel. Il999l l. LF DA dMiihlenbein and Mahnid . 



200l|), 



Goldberd . l2002l ). hBOA (jPelikan and Goldberd . l200lh . SEAM 



BOA (iPel ikan et alj 

( Watson], |2002, 20061 ). etc. — have touted the superiority of their algorithms vis-a-vis the 
SGA on this matter. The simple experiments in this section show that on this matter, the 
SGA is more powerful than it is commonly thought to be. 



12. Conclusion 

The biosphere is replete with organisms that are exquisitely well adapted to the environ- 
mental niches they inhabit. Natural sexual evolution has been crucial to the generation 
of what are arguably the most highly adapted of these organisms — cheetahs, owls, hu- 
mans etc. A deeply intriguing idea is that we can build adaptation algorithms which, at an 
abstract level, mimic the behavior of natural sexual evolution, and in doing so, "harness" 
something of the adaptive power of this incredibly effective process. 

But what is the abstract level at which natural sexual evolution should be mimicked? 
In other words given everything we know about natural sexual evolutionary systems, how 
do we distinguish between aspects of these systems which are essential to their adaptive 
power, and those which can be viewed as "mere biological detail" and need not be simulated, 
especially when taking a first swipe at building an artificial evolutionary systems which 
harnesses the power of natural sexual evolution? For instance is it necessary for such "first- 
order" artificial evolutionary systems to simulate hydrogen bonding between the bases of 
DNA strands? What about diploidy, or the way genomes of organisms are comprised of 
multiple chromosomes? And how crucial is the fact that crossover takes place between 
homologous chromosomes during meiosis? Engineers of yesteryear faced a similar quandary 
when trying to ascertain just what it is about birds that gives them their capacity for flight? 
Stories of inventors in feathered suits jumping to their deaths off cliffs and buildings bear 
testament to the fact that that our initial answers to such questions are often incorrect. It 
was only after the realization that birds were "using" Bernoulli's principle to stay aloft that 
researchers began to make any real progress towards building machines that successfully 
harnessed the principle underlying birds' capacity for soaring. One can infer the following 
general rule from this example: without a good understanding of exactly why a natural 
system exhibits a certain useful phenomenon, efforts to build artificial systems that exhibit 
the same phenomenon by mimicking the natural system will be misguided and are unlikely 
to be successful. 
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Figure 9: Results of Experiment 7 (o = 3, = 1000, r = 10). A series of eight plots 
showing the dynamics of the eight genomes in iBs under the action of SFSGA. The 
independent axis in each plot shows the generation number, and the dependent 
axis gives the average frequency of a genome in a population averaged over 10 
trials. The title of each plot displays the genome and its f-value. The thin black 
line plots the average frequency dynamics of the genome under the action of SFSGA 
at every tenth generation. The error bars show one standard deviation above and 
below the average. The average f-value of the genomes in this experiment is 
2.6545. 
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Figure 10: Results of Experiment 8 (o = 3, = 1000, r = 10). A series of eight plots 
showing the dynamics of the eight genomes in 533 under the action of SFSGA. 
The independent axis in each plot shows the generation number, and the de- 
pendent axis gives the average frequency of a genome in a population averaged 
over 10 trials. The title of each plot displays the genome and its f- value. The 
thin black line plots the average frequency dynamics of the genome under the 
action of SFSGA at every tenth generation. The error bars show one standard 
deviation above and below the average. The average f-value of the genomes in 
this experiment is 2.1367. 
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Figure 11: Results of Experiment 9 (o = 3, = 1000, r = 10). A series of eight plots 
showing the dynamics of the eight genomes in 533 under the action of SFSGA. 
The independent axis in each plot shows the generation number, and the de- 
pendent axis gives the average frequency of a genome in a population averaged 
over 10 trials. The title of each plot displays the genome and its f- value. The 
thin black line plots the average frequency dynamics of the genome under the 
action of SFSGA at every tenth generation. The error bars show one standard 
deviation above and below the average. The average f-value of the genomes in 
this experiment is 2.4885. 
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Figure 12: Results of Experiment 10 (o = 3, = 1000, r = 10). A series of eight plots 
showing the dynamics of the eight genomes in 533 under the action of SFSGA. 
The independent axis in each plot shows the generation number, and the de- 
pendent axis gives the average frequency of a genome in a population averaged 
over 10 trials. The title of each plot displays the genome and its f- value. The 
thin black line plots the average frequency dynamics of the genome under the 
action of SFSGA at every tenth generation. The error bars show one standard 
deviation above and below the average. The average f-value of the genomes in 
this experiment is 2.4537. 
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Figure 13: Results of Experiment 11 (o = 3, N = 1000, r = 10). A series of eight plots 
showing the dynamics of the eight genomes in 533 under the action of SFSGA. 
The independent axis in each plot shows the generation number, and the de- 
pendent axis gives the average frequency of a genome in a population averaged 
over 10 trials. The title of each plot displays the genome and its f- value. The 
thin black line plots the average frequency dynamics of the genome under the 
action of SFSGA at every tenth generation. The error bars show one standard 
deviation above and below the average. The average f-value of the genomes in 
this experiment is 2.3625. 
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Figure 14: Results of Experiment 12 (o = 3, = 1000, r = 10). A series of eight plots 
showing the dynamics of the eight genomes in 533 under the action of SFSGA. 
The independent axis in each plot shows the generation number, and the de- 
pendent axis gives the average frequency of a genome in a population averaged 
over 10 trials. The title of each plot displays the genome and its f- value. The 
thin black line plots the average frequency dynamics of the genome under the 
action of SFSGA at every tenth generation. The error bars show one standard 
deviation above and below the average. The average f-value of the genomes in 
this experiment is 2.6545. 



44 



Towards a Sound Theory of Adaptation for the SGA 



The field of Population Genetics stems from the efforts of its founders — Fisher, 
Wright and Haldane — to reconcile Darwin's theory of adaptation by natural selection 
with Mendel's theory of genetics (od d as it may no w seem, these two theories were once 
thought to contradict each other. See lOkashal - liooi l. The literature of this field seems like 



the most reasonable place to look for answers to questions about how and why adaptation 
occurs in natural sexually reproducing populations. Unfortunately Population Genetics 
does not hold ready answers to these questions. Indeed the differing theories of Fisher and 
Wright regarding exactly this issue has been the sub ject of a longstanding and ongoing 
debate (|Wade and Goodnight! Il998l : iBrodie Ilj . l2nnnl ). Significant empirical evidence has 



been gathered by both sides in support of their respective positions yet a definitive answer 
has not emerged. The absence of a definitive answer makes it difficult to make principled 
decisions about the level of detail which must be present in an artificial system which seeks, 
through mimicry, to harness something of the adaptive power of natural sexual evolution. 

Let us return to the analogy with the field of aviation that we introduced above. 
For the sake of argument let us suppose that in the age before the discovery of Bernoulli's 
principle someone had, for some reason or the other, succeeded in inventing a simple winged 
machine which a) mimicked birds at some relatively abstract level, and b) was capable of 
soaring long distances (even if slowly, or inefficiently). Such a machine would immediately be 
incredibly interesting because when compared to the complex body of a bird, such a machine 
would be much more amenable to analysis. The principle underlying this machine's ability to 
soar, once derived, could then be used to build "better" soaring machines. This underlying 
principle would also play a very important part in the development of an accurate theory 
of why birds can soaqj. The implications of this vignette for the importance of the SGA to 
the fields of Evolutionary Computation and Evolutionary Biology should be evident. The 
SGA should thus be viewed as a 'lucky break', one that can and should be exploited for its 
potential to advance theories and applications of the adaptive capacity of sexual evolution. 

Let us spell out the importance of studying the SGA's capacity for adaptation. As 
models of sexual evolutionary systems go, the SGA is arguably the simplest one which has 
regularly been observed to adapt high-quality solutions despite the almost certain presence 
of non-trivial epistasis between genomic loci, in other words, despite its application to 
problems whose representations are in all likelihood riddled with local optima. Because 
of its effectiveness in spite of its simplicity, the SGA is a model of sexual evolution that 
is highly likely to a) yield an explanation for the incredible adaptive capacity of sexual 
evolution and b) precipitate the identification of classes of non-trivial epistasis which do 
not pose much of a problem for sexual evolution. The SGA is of course not the last word in 
evolution inspired adaptive systems. Efforts to extend this algorithm in ways that "increase 
its adaptive power" should and have been made. However if, while attempting to extend 
the SGA, one works within a flawed paradigm, one is unlikely to capitalize on, and may 
even compromise, whatever "power" the SGA derives, by virtue of imitation, from natural 
sexual evolution. A non-dogmatic study of how SGAs perform adaptation has the potential 
to yield a theory which accurately explains the reasons behind the SGA's frequent success. 
Such a theory will probably usher in a new paradigm within which fruitful research into 



4. If the inventor of the winged machine gives an incorrect reason for why his machine can soar, that 
would probably slow down the progression described above, but it would take away nothing from the 
importance of the winged machine itself. 
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the construction of more "powerful" evolutionary algorithms can proceed. If such a theory 
differs significantly from those of Fisher and Wright it is likely to have deep implications 
for the field of Population Genetics and the larger field of Evolutionary Biology (within 
which several basic questions — why sex? Why punctuated equilibrium? Why diploidy 
and polyploidy? Why speciation? What is the unit of selection? — have yet to receive 
satisfactory answers). Finally, a study that reveals the SGA's capacity for adaptation will 
also undoubtedly reveal classes of problems that SGAs can efficiently solve. These classes 
of problems may prove useful to machine learning researchers in their efforts to find semi- 
principled reductions of difficult learning problems to problems for which robust and efficient 
solvers exist. 

This paper makes two concrete contributions. Firstly, in sections [2VI10I we have derived 
results which we believe are relevant to the riddle of the SGA's capacity for adaptation. Sec- 
ondly in section [TT] we have presented results which show that an oft perceived shortcoming 
of the SGA is misplaced. The following two paragraphs expand upon these contributions. 

As we discussed in section II. 2| a promising way to understand the effect of selection 
and variation on the composition of the evolving population of an SGA is by understanding 
the multi-generational effect of these operations on the search distribution of the SGA. One 
way to study an evolving high-dimensional distribution is to study its evolving multivariate 
marginals. In this paper we derive conditions under which a multivariate marginal of the 
search distribution of an SGA, with an infinite population of long genomes, can be ap- 
proximated over multiple generations. In other words we derive conditions under which the 
frequency dynamics of some family of schemata under the action of an SGA, with an infinite 
population of long genomes, can be approximated over multiple g enerations. The coii ditions 
we derived in this paper are much weaker than those derived by Wright et al. ( 20031 ). This 



makes our result more useful. The conclusions reached in section [9] are reached by making 
small leaps of intuition. In section [10] we experimentally validated these conclusions. Our 
validation, though indirect, is based on assumptions and modeling decisions which are, in 
our opinion, uncontroversial. 

Besides being incorrectly used to support an outlandish hypothesis about what the 
SGA can do (hierarchical building block assembly), Holland's Schema Theorem has also 
heavily shaped opin ion ab out what the SGA c annot do. Following the experiments by 
Mitchell et aD (Il992l ). and iForrest and Mitchell (Il993l ). the perceived abilities of the SGA 



stand compromised, yet the perceived limitations of the SGA have remained unchanged. 
The SGA is currently thought to be incapable of increasing the frequency of a low-order 
schema with higher than averag e fitne ss when the defining length of that schema is large 
(jOoldberg et al.l . ll989l : [Goldbergl . l2002l ^. In section [U] we argued that this perception is just 



plain wrong — an SGA can increase the frequency of a low-order schema with higher than 
average fitness even when the defining length of that schema is large. 

In closing we briefly mention that we have recently obtained a new, simple, and (in 
our opinion) satisfying theory which explains the SGA's remarkable capacity for adaptation. 
We have also identified a class of hard statistical problems such that a) the problems in 
this class can be solved efficiently and robustly by an SGA, and b) this class is likely 
to be a useful target of machine learning reductions. All of this will soon appear in our 
forthcoming dissertation. Our theory relies crucially on the SGA's ability to increase the 
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frequency of schemata of low order and above average fitness, even when the defining bits 
of those schemata are widely dispersed. 
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Appendix 

Lemma 27 For any finite set X, and any metric space {T,d),let ^ : T ^ and let B : X 
[T [0, 1]] he function^ such that for any h G T, and any x d X , {B{x)){h) ~ (A{h)){x). For any 
X ^ X , and for any h* 6 T, if the following statement is true 

\/x G X,Ve^ > 0,34 > 0,\/h e T,d{h,h*) < 4 ^ \{B{x)){h) - {B{x)){h*)\ < 

Then we have that 

Ve > 0,3(5 > 0,V/i e T,d{h,h*) < S ^ d{A{h),A{h*)) < e 

This lemma says that A is continuous at h* if for all a; G V, B{x) is continuous at h* . 
Proof: We first prove the following two claims 

5. For any sets X, Y we use the notation [X — » Y] to denote the set of all functions from X to Y 
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Claim 1 

Vx e X s.t. (i3(a;))(/i*) > 0,Ve^ > 0,3,5^ > 0,V/i e T, 

dih,h*) <S^^ \il3ix)){h) - iBix))ih*)\ < e,.{B{x)){h*) 

This claim follows from the continuity of B{x) at h* for all a; G X and the fact that {B{x)){h*) is a 
positive constant w.r.t. h. 

Claim 2 For all heT 

J2 \iAih*))ix)~iA{h)){x)\= J2 Uihmx) - iAih*))ix)\ 

x^Xs.t. xGXs.t. 
{A{h'}){x)> {A{h)){x)> 
{A{h}}{x) {A{h-mx) 

The proof of this claim is as follows: for all /i e T, 

j2iAih*)ix))-iAihmx) = o 



xGX 



J2 iA{h*)){x) - (Aihrnx) - J2 iAih))ix)-{A{h*))ix) = 



x^Xs.t. x^Xs.t. 
iA{h')){x)> (A(h))(x-)> 

{A{hmx) (A(h-mx) 



^ iA{h*)){x) - iAih)){x) ^ ^ {Aih))ix) - iAih*))ix) 



x^Xs.t. x^Xs.t. 

iA{hn)(x)> {A{h)){x)> 

(Aih)){x) (Aih'))ix) 



^ {A{h*)){x)-{A{h)){x) 



{A{h)){x) {A{h*)){a 



xeXs.t. 



x£Xs.t. 

{A(h'))(x)> (A{h))(x)> 

(Aihmx) {Aih')Kx) 



^ \{A{h*))ix)-{A{h))ix)\= \iA{h)){x) - {Aih*)){x)\ 

xGXs.t. xGXs.t. 

iA{h-mx)> iA{h))(x)> 

(A{h)){x) {A{h'))(x) 

We now prove the lemma. Using claim 1 and the fact that X is finite, we get that Ve > 0, 3(5 > 0, 
yhe[X ^ M] such that d{h, h*) < S, 

J2 mxmh*)~iBixmh)\< ^ 

x^Xs.t. x^Xs.t. 
{A{hn){x)> (A{h')){x)> 
{A(h)){x) iAih))(x) 

=> J2 \iA{h*)){x) - iAih)){x)\ < J2 

x^Xs.t. x^Xs.t. 

(A(h''mx)> {Aih'mx)> 

{A(h))(x) (A{h))(x) 



Y \{A{h*)){x) ~ iAih))ix)\ < 



e 



xGXs.t. 

iAih'))(x)> 
{A{h)){x) 



By Claim 2 and the result above, we have that Ve > 0, 35 > 0, Wh e [X ^ R] such that d{h, h*) < 5, 

J2 \iAih))ix) ~ iAih*))ix)\ < ^ 
xeXs.t. 

{A{h)){x)> 
{Aih'))ix) 
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Therefore, given the two previous results, we have that Ve > 0, 3(5 > 0, V/i G [A ^ M] such that 
d{h,h*)<S, 

Y,\iAih))ix)~{Amix))\<e m 



Lemma 28 Let X be a finite set, and let T € he a transmission function. Then for any p' G A'''" 
and any e > 0, there exists a 6 > such that for any p G A"^ , 

d{p ,p')<5^ diVrp , Vtp') < e 

Sketch of Proof: Let ^ : A^ ^ A-^ be defined such that {A{p)){x) = (VTp)(a;). Let 
B : X ^ [A^ ^ [0,1]] be defined such tliat {B{x)){p) = {VTp)ix). The reader can check that for 
any x £ X, B{x) is a continuous function. The application of lemma 1 completes the proof. 



By similar arguments, we obtain the following two lemmas. 

Lemma 29 Let X be a finite set, and let f : X ^ M+ be a function. Then for any p' G A'''" and 
any e > 0, there exists a 6 > Q such that for any p G A'^, 

d{p ,p')<S=> d{Sfp , Sfp') < e 

Lemma 30 Let X be a finite set, and let p G A^ be a distribution. Then for any f E [X ^ ^'^], 
and any e > 0, there exists a S > such that for any / G [A ^ IR+], 

d{f,f)<S^d{Sfp,Sf,p)<e 
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